Phylogentic trees. Everyone loves them. Particularly in the context of an outbreak they are great to look at the genomic relatedness (or not) between isolates collected from both clinical cases and environmental swabs.
Generally trees have been put together by bioinformaticians and scientists working in whole genome sequencing (WGS) laboratories. However, in their raw form they can be challenging to correctly interpret and infer relevant information. WGS alone is not a silver bullet regarding outbreak investigations. WGS data, and by extension phylogentic visulisation of this data, depends on, and is greatly enhanced by, contextual epidemiological information.
As such, there is a greater need for epidemiologists to be able to work with tree data to effectively communicate to stakeholders. Whether this be the progress of an outbreak, identifying suspected causative agents based on genomic relatedness, visualising contextual information for a pathogen for surveillance purposes, and many other reasons. There are a number of tools available to both view, manipulate and enhance tree data.
In this Lesson From the Field you will use the Interactive Tree of Life (iTOL - https://itol.embl.de/) to do some very quick visulisation, inspection and manipulation of trees. It is also possible to do annotations and apply metadata but this will be outside the scope of this LFF.
You will then take a subset of the tree looked at in iTOL and import it into an Rmarkdown document to do some visualisations. This will also serve as a bit of an overview of Rmarkdown and its utility both for analysis developement as well as documenting your work and being able to produce reports from this.
R code to read in, manipulate and produce
phylogentic trees
R code to customise outputsrmarkdown and the
pacman approach to package managementThe majority of this LFF will be following along this webpage. Later you will run, edit, and write some R code in the provided .rmd file. Some introductory resources to this have also been provided in the following sections.
The files can either be found within the LFF folder. Or you can download them by clicking the links below (if they open as a text web page, right click and choose “save as”). This webpage is actually an Rmarkdown document hosted online. If you want to see the actual file, you can find it in the github repo this is hosted on - this is left as an exercise to the reader.
Later in this exercise we will create a subtree from the big tree. The link below is the produced subtree if you are unable to create it and get stuck.
Some minor set up will be required. This will also serve as a brief
introduction to rmarkdown as well as the
pacman package, which can serve as a relatively
straightforward way to facilitate a degree code transferability.
You have “r file to be entered” but before that we do some set up in
the R console.
Open RStudio
First we will install the pacman package (https://cran.r-project.org/package=pacman). pacman is
basically a wrapper for a lot of package
management tasks. The most useful one I have found is
the p_load function, which says to load these libraries,
and if they are not installed, to install them. If you work in a way
where everybody uses pacman then you don’t have to worry
about what packages you need to install for someones script as they will
happen automatically when you run it.
The only requirement is that you have this package installed yourself first and that you need to load it first in your code as normal.
Run the following in your RStudio command line.
install.packages("pacman")
Now you can have a convention that when you write a script you start with your code like the below example:
library(pacman) # load the pacman package
# The following then loads (and installs if missing) the following packages.
# Having each package on a line like this allows for commenting to explain what/why each package is used
p_load(tidyverse,
rio, # package for easy import/export of data
treeio, # package for input/output tree data
ggtree) # visualising trees with syntax like ggplot (included in tidyverse)
While we are here, lets check we have installed / updated the
packages needed for rmarkdown and
knitting.
Run the following in your RStudio command line.
install.packages(c("rmarkdown", "knitr"))
We are ready to rock! We will actually not be starting in
R but good on you for getting everything set up!
Firstly we aren’t going to start in RStudio!
The Interactive Tree of Life (iTOL) is a free web based interface to do phylogentic tree investigations and visulisations.
It is very powerful as a tool itself and can also overlay metadata but For the purposes of this lesson we will keep this simple to demonstrate the utility of iTOL for doing a quick look at a tree.
Open the iTOL homepage (https://itol.embl.de/)
You can create an account if you wish to save trees but this is not needed. Click on the Upload button in the top bar.
Have a read of the page. There are a couple of options when uploading the tree. You can either Browse to your tree file or drag and drop it directly into the window.
You don’t need to worry about the different tree file formats. In my experience, Newick is the most predominant and what we will be using.
Note: There is some flexibility in the format of the
filename extension that is uploaded. All tree files are
basically a text file. iTOL (and R packages) will accept a
few filetypes - it is what is inside that matters. Some tree files with
be filename.txt or some will be
filename.tree. This doesn’t matter to us.
Choose either method and upload
Shigella_tree.txt.
The page will autoload if you drag and drop, otherwise press the upload button if you browsed and selected.
You should now see the default view of a large tree. iTOL will default to a circle tree as this tends to be more appropriate for larger datasets. Smaller ones will default to a rectangular tree.
You can zoom in with a mouse scroll wheel and click and drag the tree around. Hovering your mouse over labels will provide some technical information (or metadata if present).
In the top right of the page you will see a control panel for the tree visualisation.
Try some of the different options in the Basic tab.
We won’t be looking at the Advanced or Datasets tab.
Most options in the Basic should be reasonably self explanatory and will control things like the size, font, position and alignment of the labels.
Convention and preferences may vary but many seem to prefer Label options -> Position -> At tips. So that it can be easier to get a sense of branches and where labels match up on trees. Often though, if you zoom in to parts of large trees with many leaves when doing this they may overlapp. You can change the Font Style to a smaller number (by defaulted measured in pixels ~ px) to see clearer.
Size of 10px (left) vs 4px (right) - note the labels will be different on your screen as they were changed to more easily generate and match epid data later.
iTOL can also be useful to easily prune a tree to a subset that you are interested in. For example, you have a large dataset of SARS-CoV-2 sequences containing a couple of sublineages. You want to further analyse one lineage or an outbreak sitting on its own branch.
Prunning allows you to select a node / clade and all of the leaves contained within it.
Lets prune our large tree to just look at the isolates that (depending on if you changed the rotation or arc of the visualisation) is on the upper right part of the circle tree. Hovering over the long node/branch it should indicate that this contains 44 leaves.
Clicking this will show a menu for Clade functions.
Note: You can also click on labels to get Leaf functions in a similar way. This can often be useful to Copy node ID if you need to further examine a particular isolate in an associated linelisting or to copy this ID to highlight it through functions.
Press Add clade to pruned tree
This will highlight everything below in green and pop up a box with a summary of your pruning selection. You can add other branches if you need to include other clades or groups of isolates for analysis.
Press Prune tree
Now you should have a greatly reduced tree that likely looks very strange presented as a circle. Use the control panel to fix this.
Use the control panel to change the Mode -> Rectangular.
Presented now should be a style of tree you may have seen before. Similar you can add customisation here and prune further if need be but we will now use this to export this smaller subtree as a file.
The final tab on the Control panel is the Export
tab. Using this we can export the current tree. We will do this with
this pruned tree to use this smaller tree in some R
visualisations.
Press Export tab on the Control panel
The default format is SVG (a graphics picture format).
Change this to Newick tree under Text as below.
Leave the other options at their default. For the Filename
we will choose a name to go with the R file we will work
on.
Set the file name option to subtree and press Export
This should automatically prompt you to save the file somewhere or it will download automatically to your “Downloads” folder (or wherever you set this) depending on your browser settings.
!!IMPORTANT!! Make sure you save/move this file to
the same folder as the LFF_response.rmd file.
This concludes the iTOL section and now we will move to
RStudio.
Move to your RStudio session and open the
LFF_response.rmd file to work in and write your responses
into.
You may get a notice or banner at the top to install some necessary
packages such as knitr or rmarkdown, please
press install if asked.
The file you are working in is an Rmarkdown document. You may notice
instead of every line being code with some comments like an
R script. This one allows you to write a report and the
code is separated into chunks.
Firs thing you may notice is the block of code at the top. This header is basically the instructions to RStudio on how to compile the document and some basic information.
Edit your name (within the quote marks) into the header
---
title: "LFF: Visualising Phylogentic Trees"
author: "Your Name Here" # Edit your name into here
date: "2023-04-03"
output: html_document
---
The output option says what format to compile the document into with
the most common being html_document,
pdf_document and word_document. You don’t need
to edit these in. When you choose to Knit the document,
these options will be automatically added.
This document you are reading now is a rendered html document, with a few more options added (such as to get the table of contents).
Rmarkdown is an excellent way to blend your analysis with associated notes or documentation. Think of it like your data analysis lab journal. It allows you to perform chunks of analysis and rapily run and rerun chunks of code.
A chunk of R code in an .rmd document
Yes. The technical term for these blocks of code is a chunk.
As above you can see that the area of code is defined by three back ticks ```{r} followed by your code and closed with another three backticks ```.
This allows you to just run the code within that section by pressing
the green play arrow in the top right corner. The code is executed and
the results are displayed in line within the document itself instead of
in the R console. This is also nice to be able to go
between parts of your analysis and keep results visible without needing
to rerun a potentially lengthy script.
The button in the middle with the downarrow and green line will run all code chunks above this chunk. This is very useful to run all earlier datacleaning or analysis that the chunk you are working in depends on. Either for when you come back to your document or to “refresh” your data to run your current chunk newly again.
The left button in this corner allows you to set chunk options, but you will be very unlikely to need this.
The top right of your overall .rmd document window has the Run option that pressing on also gives options about running chunks or individual lines of code.
You can either copy paste a chunk set up or type manually the text to create a code chunk. Alternatively, you can use the keyboard shortcut: ctrl + alt + i. You can try this in LFF_response.rmd.
There will be some questions for you to answer during this part of the exercise. This will be signified in the response document by the following:
<div class="response">
Type your response in here.
</div>
This is so your responses will stand out when looking at your final
document. You can go over multiple lines, just make sure the first line
has the <div class="response"> and the final line has
</div>.
From here, much of the information may be similar or duplicated between this document and LFF_response.rmd. As you will run the code in there and follow along with the following outputs to get the same results!
Follow along below and in your response document
Lets import the subtree we created in the iTOL section. At a basic
import. It doesn’t look like much when using the base R
plotting as well as a basic gg plot from
ggtree.
tree <- read.tree("subtree.txt") # Import our tree file
plot(tree) # Default plotting using base R
ggtree(tree) # Default plotting using ggtree
ggtree(tree, layout = "circular") # By default ggtree will plot rectangular, use this to plot circular
You will also notice that when you run a chunk with multiple plots like this, they will tab themselves within the document. In this webpage they show up separately with each code line between as when documents are knitted, outputs are presented as they go.
You can also press the left most button in the top right corner to show this in a new window
Answer the following question in your response document.
What are the differences between each of the tree in
Rand with the visualisation we had in iTOL?
The default ggtree does not include much since it
requires you to be more explicit about what you want from it. The
gg syntax can be summarised as your base data for your plot
and then you add on (with a +) other visual pieces (geom’s: geom ~
geomtry) to it.
p <- ggtree(tree) + geom_tiplab(size = 2) # add on tip labels to your plot with the given size
p # since the above line creates the 'p' object, we need to print it by writing it here
Note: When asigning a plot to a variable
(p above), we need to then state it so it prints into the
document. Because we are working in an rmarkdown document, you will
notice it doesn’t show up in the usual RStudio plot window.
To do this, you can just type p enter into your
R console since this is stored in your environment. You can
try it now and see the plot show up in your plot window.
First we import our data. This is very straightforward as we are
using the packed rio (for R Input/Output). It will detect
the file type and import appropriate (most of the time).
epidata <- import("subtree_epidata.csv")
Simple!
Lets have a look at what is in our epidata. You can do this by just
clicking on the object in the top right panel of RStudio.
Click on epidata and it will view it in the main
RStudio panel.
| Sample_ID | Residence | Travel | Resistance | Sex |
|---|---|---|---|---|
| Sample_178 | NSW | International | Drug A | Female |
| Sample_179 | NSW | International | Drug A | Female |
| Sample_180 | VIC | International | Drug A | Male |
| Sample_181 | VIC | Unknown | ||
| Sample_182 | VIC | Unknown | ||
| Sample_183 | ACT | Unknown | ||
| Sample_184 | ACT | State | Drug C | Male |
| Sample_185 | ACT | Drug C | Male | |
| Sample_186 | ACT | Male | ||
| Sample_187 | VIC | Male | ||
| Sample_188 | VIC | Male | ||
| Sample_189 | VIC | Male | ||
| Sample_190 | VIC | Drug C | Male | |
| Sample_191 | QLD | Female | ||
| Sample_192 | QLD | Female | ||
| Sample_193 | QLD | Male | ||
| Sample_194 | QLD | State | Male | |
| Sample_195 | QLD | Male | ||
| Sample_196 | QLD | Female | ||
| Sample_197 | NSW | Drug B | Unknown | |
| Sample_198 | NSW | Drug B | Female | |
| Sample_199 | NSW | Drug B | Female | |
| Sample_200 | NSW | Drug B | Female | |
| Sample_201 | NSW | State | Drug B | Male |
| Sample_202 | NSW | State | Drug B | Female |
| Sample_203 | NSW | Drug B | Female | |
| Sample_204 | NSW | Drug B | Male | |
| Sample_205 | NSW | Drug B | Male | |
| Sample_206 | QLD | Drug A | Female | |
| Sample_207 | QLD | Drug A | Female | |
| Sample_208 | QLD | Drug A | Male | |
| Sample_209 | QLD | Male | ||
| Sample_210 | VIC | State | Unknown | |
| Sample_211 | Laos | Female | ||
| Sample_212 | Laos | Female | ||
| Sample_213 | Laos | Female | ||
| Sample_214 | Laos | Female | ||
| Sample_215 | Vietnam | Male | ||
| Sample_216 | Vietnam | Drug C | Male | |
| Sample_217 | Vietnam | Female | ||
| Sample_218 | Vietnam | Male | ||
| Sample_219 | Vietnam | Drug C | Male | |
| Sample_220 | Vietnam | Female | ||
| Sample_221 | NSW | State | Unknown |
We need to attach epi or meta data to our tree. We do this using the
attacher operator: %<+%
Yes, it is a bit of a weird one. The general syntax is as follows
plotobject %<+% data
You can think of this like an assignment (<-)
combined with the + because you are adding data in.
Important! The first column of your epidata needs
contain the same labels as present in your tree object so that
R knows which sequence is associated with which epidata
entry.
This is how we do it here
p <- ggtree(tree) %<+% epidata #attach our epidata to the tree object and reassign (update it)
Now the fun stuff!
We see that we have a few bits of data with the most complete being
Residence. Our data set seems to be a bit odd in terms of
location but it must make for an interesting group then.
Like we discussed before, we add the tip label to the plot
object. This time, specifying that we want to fill based on
Residence and apply this to the label geometry
of our tree. This looks something like this:
p_res <- p + geom_tiplab(aes(fill = Residence), geom = "label", size = 2)
p_res
Looking at this we can see some groupings and potential clades of samples based on the tree structure and the regions.
We can highlight clades of interest. This is done based on the internal node reference structure in the tree. The easiest way to identify these is to add the nodes to our plot for visual inspection. We do this by adding some text geometry and specify that we want to see the node labels and colour them so they are visible.
p_node <- p_res + geom_text(aes(label = node), size = 5, colour = "darkred")
p_node
On your response document you will definitely want to launch this in
a new window or load p_node in your console and zoom
in.
These node numbers are generated as a part of the tree file and can be referred to for colouring, rotating and general manipulation of a tree.
We can kind of see that there are a few distinct groupings and the
parent node of these are 79 (arguably you could split this
between the two states and Sample_210 is in an interesting
position but we will keep this simple), 70 and
60.
Now we have identified these, we can apply some highlighting to these regions.
You will also see below that it can be helpful for managing your
code, and so it is readable, to write over multiple lines. You can
continue lines of code in things like plots by ending the line with a
+ and entering to the line below.
p_highlight <- p_res +
geom_highlight(node = 79, fill = "yellow") +
geom_highlight(node = 70, fill = "lightblue")
p_highlight
Hmm, that doesn’t cover all the way across. We can manually extend the region, which does become a bit of an exercise in tuning these numbers. Which is why it is very helpful to be able to rerun code in a chunk!
p_highlight <- p_res +
geom_highlight(node = 79, fill = "yellow", extend = 0.01) +
geom_highlight(node = 70, fill = "lightblue", extend = 0.015)
p_highlight
Your turn to add the final group at node 60.
In your response document, add the extra line of code, with extend if need be, for node
60
Lets go back to our clean tree and show another way to add epidata.
We can also add a variable to the end of the leaves of our tree. Lets
do that for the Sex variable. This is using the geometry
geom_tippoint as we are colouring in the points of each tip
of the tree.
p_tip <- p + geom_tiplab(size = 2) + # To add the basic sampe_id back
geom_tippoint(aes(color = Sex), size = 2)
p_tip
Heatmaps can get reasonably complex when trying to add them to plots.
In the R world there seem to be a few options with some
newer packages being developed and actively worked on.
This is included just for reference and peoples interest. You can copy this code and run it but this is not expected.
This is challenging to automate as a part of regular reporting. These plots requires tuning of parameters to look nice and format properly when working on just one set of outbreak data (ask me how I know that~~~~~~)
p_load(ggnewscale) # We need another library for compilation
# Data preparation
epi_travel <- data.frame("Travel" = epidata$Travel)
rownames(epi_travel) <- epidata$Sample_ID
epi_drugres <- data.frame("DrugRes" = epidata$Resistance)
rownames(epi_drugres) <- epidata$Sample_ID
# Then we need to go through and add the heatmap layers iteratively
# First the travel information
p_heat <- gheatmap(p_res + new_scale_fill(),
epi_travel,
colnames = FALSE,
width = 0.03,
offset = 0.005) +
scale_fill_manual(name = "Travel",
breaks = c("International", "State"),
values = c("red", "blue"),
na.value = "white")
# Now some drug resistance information
p_heat <- gheatmap(p_heat + new_scale_fill(),
epi_drugres,
colnames = FALSE,
width = 0.03,
offset = 0.01) +
scale_fill_manual(name = "Drug Resistance",
breaks = c("Drug A", "Drug B", "Drug C"),
values = c("green", "purple", "orange"),
na.value = "white")
p_heat
The final section in the response document will generate a randomly shaped tree and random dataset for you to use to create a plot based on what you have learned during this session.
Below is a code chunk to generate a random tree and a random dataset.
set.seed is there to make the random chance repeatable.
Make up a number yourself and replace the 1234 with it.
This means that when you submit this to me, I will see the results you
worked with, but everyone in our group will generate a different tree
when they use a different number.
You can check this against the same seed in your response document to see. Then change the seed to a different number you make up.
set.seed(1234)
size = 10
random_tree <- rtree(size)
random_data <- data.frame("ID" = random_tree$tip.label,
"Location" = sample(c("QLD","NSW","VIC"), size, replace = TRUE),
"Sex" = sample(c("Male", "Female"), size, replace = TRUE),
"Travel" = sample(c("Overseas", ""), size, replace = TRUE, prob = c(0.2, 0.8)))
r <- ggtree(random_tree) + geom_tiplab()
r
I’m interested to see what you can make!
To submit your response to me you can either email the
LFF_response.rmd document and/or a compiled html page that
you can knit by pressing the Knit option at the top of your
main RStudio window.
This will generate a .html page in the folder you were working in. Email that to me.
That’s it. You are done!